-
Notifications
You must be signed in to change notification settings - Fork 82
[CDF] to_cdf #519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[CDF] to_cdf #519
Conversation
This adds comprehensive write support to the open_as_file() function with efficient memory management and streaming capabilities. Key features: - BufferedStream: SpooledTemporaryFile wrapper with chunked I/O (5MB memory threshold) - Write modes: 'wb' (write), 'ab' (append) - binary only - Adapter pattern: write_from_stream() method (opt-in for adapters) - Compression support: .gz, .bz2, .xz files handled automatically - Local files and S3 URIs supported via FSSpecAdapter - Protocols for type safety: SupportsRead, SupportsWrite Implementation details: - read_from()/write_to() methods use shutil.copyfileobj for chunked copying - Context manager pattern buffers writes and flushes on exit - No breaking changes to existing read functionality
Cleaned up CDF Serializer
|
Added from kloppy import cdf
dataset = cdf.load_tracking(
meta_data="output/metadata.json",
raw_data="output/tracking.jsonl.gz"
)Full functionality including tests. The implementation includes a forced validation of the incoming data, using the I can also make this validation optional of course. |
| only_alive: Optional[bool] = True, | ||
| ) -> TrackingDataset: | ||
| """ | ||
| Load Common Data Format broadcast tracking data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this only work for broadcast tracking data? I guess this is a copy-paste error.
| @@ -0,0 +1,5 @@ | |||
| """Functions for loading SkillCorner broadcast tracking data.""" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| """Functions for loading SkillCorner broadcast tracking data.""" | |
| """Functions for loading data in the Common Data Format (CDF) standard.""" |
- I think it would be good to add a sentence explaining the CDF with a reference to the arxiv paper.
| base_coordinate_system: ProviderCoordinateSystem | None = None, | ||
| pitch_length: float | None = None, | ||
| pitch_width: float | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use Optional[T] here instead of T | None to be compatible with Python 3.9
|
|
||
| @property | ||
| def pitch_dimensions(self) -> PitchDimensions: | ||
| return NormalizedPitchDimensions( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these should be MetricPitchDimensions
| raise KloppyParameterError(f"Engine {engine} is not valid") | ||
|
|
||
| def to_cdf(self): | ||
| if self.dataset_type != DatasetType.TRACKING: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the if here and override in the TrackingDataset. A parent class should not be aware of its children.
|
|
||
| serializer = CDFTrackingSerializer() | ||
|
|
||
| # TODO: write files but also support non-local files, similar to how open_as_file supports non-local files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is done now.
| @contextlib.contextmanager | ||
| def _write_context_manager( | ||
| uri: str, mode: str | ||
| ) -> Generator[BinaryIO, None, None]: | ||
| """ | ||
| Context manager for write operations that buffers writes and flushes to adapter on exit. | ||
| Args: | ||
| uri: The destination URI | ||
| mode: Write mode ('wb' or 'ab') | ||
| Yields: | ||
| A BufferedStream for writing | ||
| """ | ||
| buffer = BufferedStream() | ||
| try: | ||
| yield buffer | ||
| finally: | ||
| adapter = get_adapter(uri) | ||
| if adapter: | ||
| adapter.write_from_stream(uri, buffer, mode) | ||
| else: | ||
| raise AdapterError(f"No adapter found for {uri}") | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be removed. Something went wrong in the merge with master. It's now duplicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a new test file? Can't we simply use the existing SkillCorner test data?
| "pytest-httpserver", # Mock HTTP server for testing | ||
| "moto[s3,server]", # Mock AWS S3 for testing | ||
| "pre-commit>=4.2.0", # Git hooks for code quality checks | ||
| "common-data-format-validator==0.0.13", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be added as a dev dependency. Create a new optional dependency group
| pitch_dimensions=transformer.get_to_coordinate_system().pitch_dimensions, | ||
| frame_rate=frame_rate, | ||
| orientation=orientation, | ||
| provider=Provider.CDF, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether this is desired. I believe the value of this attribute should be the actual data provider. Not the data representation standard used. Doesn't the CDF store the actual data provider?
|
Can you also add some documentation on how to load CDF data and how to export a dataset to the CDF format? |
This is a continuation of stephTchembeu#5 and #513. Due to all the misalignments it was easier to create a new PR.
Overview
This PR adds
TrackingDataset.to_cdf()functionality. It includes @koenvo improved writing PR #515 and it includes the cleaned up version of work done by @stephTchembeu.Basically, we can now output kloppy tracking data to the Common Data Format (Anzer et al. 2025).
Because kloppy does not process some mandatory values for the CDF (stadium id, competition id, season id, version (tracking) and collection timing) doing the above will throw some warnings, namely:
We can resolve this by passing
additional_metadatato theto_cdffunctionality, using the Common Data Format Validator TypedDicts (you don't have to use this, but it helps keep everything in the correct schema), like so:We can then run:
This will now not throw any warnings and it should output the correct files.
Note: we set
only_alive=True, because not doing so will also show a warning.Common Data Format Validator
We have new unit tests that test the writing functionality, and tests that validate the output schema to the CDF using the common-data-format-validator. This is a development dependency. Note that if the CDF changes it's structure, these tests will fail on the kloppy side too. I can imagine this is not ideal, but not sure what to do about this. Any suggestions here are more than welcome.
Next Steps
I would like to continue with reading CDF tracking data and writing and reading CDF event data. Should I do this in a new PR, or shall I pile everything into this?